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ABSTRACT 


In many remote sensing applications, the area of a scene sensed by a single pixel can often be measured 
in square meters. This means that many objects of interest in a scene are smaller than a single pixel in 
the resulting image. Current tracking methods rely on robust object detection using multi-pixel features. A 
subpixel object does not provide enough information for these methods to work. This dissertation presents a 
method for tracking subpixel objects in image sequences captured from a stationary sensor that is critically 
sampled spatially. Using template matching, we estimate the maximum a posteriori probability of the target 
state over a sequence of images. A distance transform is used to calculate the motion prior in linear time, 
dramatically decreasing computation requirements. We compare the results of this method to a previously 
state-of-the-art track-before-detect particle filter designed for tracking small, low contrast objects using both 
synthetic and real-world imagery. Results show our method produces more accurate state estimates and 


higher detection rates than the current state of the art methods at signal-to-noise ratios as low as 3dB. 
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EXECUTIVE SUMMARY 


In many remote sensing applications, the area of a scene sensed by a single pixel can often be measured in 
square meters. At this resolution, small targets of interest present a surface area to the sensor that is smaller 
than the total area covered by the pixel. This results in the pixel’s sensed intensity comprising both the target’s 
intensity and the background intensity in the area covered by the pixel. This means that subpixel sized objects 
will not present texture or edge features to the sensor, but only intensity features, and those features are mixed, 


in proportion to target surface area and pixel resolution, with the intensity of the background scene. 


Current state-of-the-art tracking methods rely on robust object detection using multi-pixel features such as 
edges, points, and textures. A subpixel object does not provide enough information for these methods to 
work. For applications from space, a subpixel target could be something as large as a Somalian pirate boat 
when imagery is taken from a high orbital altitude. With no method to track such small targets the only 
solution is to increase the focal length of the system, reducing the field of view of the sensor and increasing 


weight and complexity. 


This dissertation describes a method for tracking subpixel targets using the diffraction of light at the aperture 
of the sensor as a signal. Based on diffraction, templates are created representing the target’s expected 
projection in the image. These templates are compared to images from the sensor to develop a likelihood of 
target location. Using this likelihood and a target motion model, the maximum a posteriori probability of the 
target’s state is estimated over the sequence of images. The final result is a path representing the most likely 


sequence of locations for the target given the imagery and motion model. 


The performance of this tracker is tested using synthetic, quasi real-world, and real-world imagery. Results 
are compared to the current state of the art method for very low observable target estimation. Results show 
that our method outperforms the state-of-the-art at all signal-to-noise ratios down to 4dB. Additionally, greater 
than 80% of all estimates are within one pixel of the actual target position down to 3dB signal-to-noise, while 


the state of the art method achieves greater than 80% accuracy at 6 dB. 
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I. INTRODUCTION 


As military use of satellite and unmanned aircraft intelligence increases, the problem of developing 
actionable intelligence from sensor data is quickly becoming intractable. In January 2010, LtGen David Dep- 
tula, Air Force Deputy Chief of Staff of Intelligence, Surveillance, and Reconnaissance said “We’re going to 
find ourselves in the not too distant future swimming in sensors and drowning in data” [1]. Gen Cartwright, 
Vice Chairman, Joint Chiefs of Staff (JCS) estimates that 2,000 intelligence analysts are needed for each 
Predator drone fitted with next generation sensors [2]. As new advanced sensors, such as the Defense Ad- 
vanced Research Projects Agency (DARPA) autonomous real-time ground ubiquitous surveillance—imaging 
system (ARGUS—IS) sensor, are deployed this number is expected to increase. 

Satellite imagery offers a wide-area view that is critical for developing intelligence data. As a 
satellite passes over an area of interest, a high-resolution camera can be employed to provide a real-time 
visual of a section of the earth’s surface. Due to the height of the satellite, a trade-off must be made between 
detailed imagery for a very short period of time, or much less detailed imagery for a longer period of time. 
For example, the highest reported commercial optical resolution for a satellite in low earth orbit (LEO) is 
0.41m or 16in [3]. Given a human target at nadir with shoulder width of ~21in, this means that the target 
projection would occupy some parts of four different pixels of the image at the best possible resolution. If 
the same satellite were moved to a higher orbit to “stare,” the surface area covered by a single pixel increases 
dramatically, rendering the same human target as a small contribution of the total photon flux sensed by that 
pixel. While the higher altitude satellite can continuously view the same location, the lower satellite will only 
be able to gather more detailed information for a short period of time. 

Sub-pixel tracking of satellite imagery can be used to solve a number of commercial and military 
problems. For example, tracking allows planners to analyze tracks of pedestrians in open air locations. 
These tracks could be used to aid in planning pedestrian paths, crowd control measures, or evacuation routes. 
Military analysts can use tracking to retrace the paths of any pedestrian targets who approached an area where 
an improvised explosive device (IED) was recently detonated. 

Another advantage of subpixel tracking is reduced bandwidth usage. Tracking can be performed 
using lower resolution imagery to reduce transmission costs while still allowing analysis currently requiring 
much higher resolutions. Using the subpixel estimated path, more detailed information can be requested for 
the parts of the image that are of interest. This same method can be applied on-board the sensor to determine 
which sections of an image to store, and which to throw out. 

Unmanned aerial vehicle (UAV) technology can also benefit from sub-pixel tracking. Current 


Federal Aviation Administration (FAA) regulations do not allow unmanned aerial vehicles (UAVs) to be 


used in the national air space (NAS) unless they can perform “see and avoid” to the same level as a human 
pilot. This means that aircraft must be able to visually detect and track other traffic with the same level 
of ability as a human pilot. This means that an aircraft on a collision course must be detected and tracked 
through the aircraft’s ego-motion. For non-cooperating aircraft below 250 knots, the majority of traffic must 
be detected at 6167ft in order to maneuver to avoid a collision [4]. Given this distance, the instantaneous 
field of view (FOV) (i.e., the field of view, in radians, of a single detecting element of the sensor) required to 
detect the frontal area of the smallest expected traffic is 0.56 milliradians. On a standard electro optical (EO) 
camera with a field of view (FOV) of 60° and 3,648 by 2,736 pixels (10 MP), the IFOV is 0.28 milliradians. 
This means that a see and avoid solution must be able to detect and track aircraft occupying no more than 
2 pixels of a LOMP image in order to avoid them. If a larger margin of safety is required, the distance must 
be increased, reducing the number of pixels occupied by the target. A sub-pixel tracking solution makes this 


problem easier to solve provided that it is robust to error and has a very low false positive rate. 


This dissertation outlines the methods used to perform subpixel tracking for critically sampled, 
additive Gaussian noise optical sensors. Chapter II discusses prior work in the fields of tracking and subpixel 
target detection. Chapter III presents the methodology used in this dissertation to accomplish the goal of 
tracking subpixel targets in imagery. The experiments used to validate this methodology are detailed in 
chapter IV and the results of those experiments are presented in Chapter V. Finally, Chapter VI discusses the 


implications of the our experimental results, and outlines areas for future research. 


Il. PRIOR WORK 


The problem of tracking subpixel targets in critically sampled imagery has not been heavily re- 
searched. Similar problem domains, such as very low observable (VLO) targets offer insights into how to 
proceed, but methods developed for this field are sometimes ill-suited to subpixel targets. The task of tracking 
targets using noisy sensors requires knowledge of the optical characteristics of the sensor, the visual and kine- 
matic characteristics of the target, and a number of images captured while the target is moving. This chapter 
discusses prior work that enables the methodology detailed in Chapter II. The major research areas covered 
in this chapter are target tracking technology, state of the art methods for tracking VLO targets, methods for 


performing detection of subpixel targets, and background subtraction techniques. 
A. TRACKING AS INFERENCE 


Tracking is a form of estimation for non-stationary statistical distributions. Bar-Shalom defines esti- 
mation as “the process of inferring the value of a quantity of interest from indirect, inaccurate, and uncertain 
observations” [5]. In order to estimate a target’s path, we must infer both its location in multiple images of 
a video sequence and the motion that best describes those locations over the entire image sequence. This 
naturally decomposes the tracking problem into detection, finding features of the target in a single frame, and 
estimation, estimating the target’s state in each frame. These two problems are often solved iteratively with 


the solutions from the previous frames used to aid in the solution of the current frame. 


The tracking problem can be most generally formulated as 


State,+1 = f(State;, process noise, ) (1) 


Measurement, = h(State,;, measurement noise, ) (2) 


where the function f(-) is the motion model, describing the evolution of the state over a single time step, and 
the function /(-) maps a known or estimated state to the expected observation value [5]. Both noise values 
are assumed to have a known statistical distribution which can be either static or dynamic. The process noise 
results in increased uncertainty of the estimated state, while the observation typically, but not guaranteed, 
helps to reduce uncertainty. The goal of tracking is to provide the lowest error state estimate possible (in the 


least squared error sense) given the process and measurement noise distributions. 


The continuous distribution for state inference is 


PUK. 121.0) = f PO X02. 2)PO%.)AK (3) 


where the random variable representing target state at time k is denoted as X;, and the random variable for the 
sensor observation at time k is Z;. If the probability distribution for X and Z both follow a known distribution 


with a closed form solution, then exact inference can be performed on the target state. 


Under certain assumptions, the optimal posterior of the target state can be calculated exactly. For 
instance, Kalman filters, are optimal methods for recursively calculating the posterior density provided that 
it is Gaussian at every time step. In order to achieve this assumption, the motion and sensor models must 
be linear since a linear function applied to a Gaussian density results in a Gaussian density. While this 
assumption is restrictive, the Kalman filter is a popular tool that has been applied successfully in a large 
number of inference applications [6, 7, 8, 9, 10, 11, 12, 13, 14]. 

In tracking problems, the distribution for random variables X; and Z; typically depend on the prior 
distribution of X;_;. One of the most popular methods for representing dynamic distributions with dependen- 
cies on past states is through a Bayesian networks. In tracking, a specific type of Bayesian network, called a 
hidden Markov model (HMM) is used to represent the time varying distribution over X and Z. This model 


assumes that random variables follow three assumptions [15]: 


1. The Markov assumption: a state, X;, is independent of all other states in the model given the previous 


state, X;_1 


2. Stationary state transitions: the probability of a particular state transition is the same regardless of the 
time at which it occurs. For example, a transition from x; to x2 where x),x2 € X will have the same 


probability at time x as it does at time j 
3. Independent observations: each observation is assumed to be independent of any previous observation. 


These three assumptions allow for a simplified solution to Equation 3 conducive to the tracking problem. 
Tracking using an HMM falls into two major categories: single state inference, and most likely sequence 
inference [15]. The first is a form of maximum likelihood (ML) estimation on the HMM distribution, while 


the second is a maximum a posteriori probability (MAP) calculation on the distribution. 


1. Maximum Likelihood Estimation 


ML estimation can applied to HMMs to calculate the maximum likelihood of a state given obser- 
vations, or to calculate the maximum posterior distribution of parameters, given a set of observations [5]. In 
tracking, ML estimation is the most common method used to infer the target state at a given time. This form 
of estimation has been used for robot motion control [16], navigation [17], pedestrian tracking [18], pose 
tracking [19], and an array of other applications [20, 21, 22, 23]. 

In cases where the ML posterior distribution is differentiable, a closed form solution of the ML 


estimate can be calculated analytically. For discrete or non-differentiable state spaces, the forward algorithm 


can be used to calculate the state probability at a given time k using observations z;._,4 [15]. This algorithm 
relies on the Markov assumption to calculate states recursively from X,, through X; without the need to 


recalculate the entire sequence of states. 


P(Xies1 = xyz) = Y p(z1.k—1, Xe = 1) P(e, Xie = x; |Xo...&) (4) 
xEX 
The forward algorithm is the underlying mechanism for many of the tracking methods in the liter- 
ature. For example, in a particle filter, the product of all previous state probabilities is represented by the 
population of particles prior to resampling. The resampling step is the equivalent of performing the summa- 


tion in Equation 4. 


A drawback to using ML on an HMM is that the estimate at time k is never updated using future 
observations. This means that a set of estimated positions using the forward algorithm will not be the same 
as the most likely set of states over all observations. Adding a backwards step to the forward algorithm 
solves this problem. The new method, called the forward-backward algorithm, solves for p(X;|z1..;) where 
0 <k <t. In situations where the distribution can be calculated in reverse, this method produces estimates 
that are closer to the optimal state estimates, however, in applications where approximation techniques are 


required, such as particle filters, the backwards filtering distribution is often difficult to specify. 


One attempt at applying the forward-backward algorithm to a particle filter method is Junkun et al. 
[24]. This method uses a fixed lag in particle positions so that the particle weights can be calculated using 
future observations to improve the estimate. The overall complexity of the algorithm is O(tn”). To achieve 
faster running time, Junkun et al. implement a number of approximations to the filtering distribution such 
as local linearization, smoothing over only one step instead of more than one, and replacing a probability 
calculation with a simpler Euclidean distance calculation. Despite these approximations, the smoothed filter 


achieves root mean squared (RMS) error as low as 0.1 at signal-to-noise ratio (SNR) of 8.3dB. 


2. Maximum a Posteriori Estimation 


Rather than calculate single state estimates iteratively, the MAP estimate attempts to determine the 


probability of a sequence of states by selecting the sequence that results in the maximum posterior probability. 
P(X. ]Z1..1,X0) = argmax p(Z1...21X1...2)P(X1..1|Xo..1-1) (5) 
Lt 


Like ML, the MAP estimate is optimal as t + [5]. However, while ML can be performed “live” 
as observations are made, always providing the most likely current state, MAP requires the entire sequence 


of observations in advance. This introduces a lag that is often undesirable in real-world tracking applications. 


Additionally, MAP is discussed in the literature as an inferior method with regards to estimation quality. For 
example, Greig et al. [25] analyze the use of MAP in the reconstruction of binary images. The key finding is 
that MAP does not perform well for the given problem set, however, this has been taken to mean that other 
methods such as markov random fields (MRFs) or Gibbs sampling are superior to MAP [26, 27]. Likewise, 
Fox and Nicholls [28] compare a MAP estimator to minimum mean-squared error (MMSE) and find that 
MAP produces lower quality estimates. As a result, there are relatively few examples of MAP used in the 
tracking literature. 

Recent work by Tran and Yuan [29], Islam et al. [30], and Yan et al. [31] show that the MAP estimate 
can produce better results than MMSE and ML for certain problem domains. Each of these methods show 
better results for MAP estimation than achieved by MMSE or ML. This indicates that the general consensus 
of MAP as a less accurate estimator is based on misinterpretation of the findings in [25, 28]. Those findings 
claim that MAP performs poorly relative to MMSE for the specific problem domain. Interpretations that state 
MAP is an inferior estimator must specify the problem domains for which this is true. Problem domains 
that exhibit maximum modes in the optimum estimate are expected to perform as well as or better than other 
estimators. Based on the work in [29], MAP can be applied to certain tracking problems with superior results. 

If calculating MAP on a discrete state space, the Viterbi algorithm offers an optimal solution to 
Equation 5 [32]. The Viterbi algorithm is a dynamic programming method for calculating the MAP in O(tn*) 
time. The most common use of this algorithm is in the field of decoding convolutional codes [33, 34, 35], but 
the methodology is applicable to any HMM with discrete state values [36]. 

A number of modified Viterbi algorithms have been developed for use in different domains. For 
example, online Viterbi [37] uses a sliding window of observations rather than requiring that all observations 
are available at once. Additionally, the computation is modified to use constant memory within the win- 
dow. While this method sacrifices the optimality of the algorithm, it allows Viterbi to be used on an infinite 
sequence of observations. 

Other modifications, such as A* Viterbi [38] and Tree Viterbi [39], attempt to reduce the total 
number of paths expanded during calculation. Both methods limit the number of nodes searched through 
the use of graph search techniques. A* Viterbi uses a priority queue and heuristic to determine which state 
to expand in the trellis. Tree Viterbi, on the other hand, performs a depth-first search of the trellis and 
uses branch cuts to limit search time. In both cases, the additional overhead of data structures and decision 
points can overshadow the gains of expanding fewer nodes in cases where hardware vectorization and cheap 
transition cost calculation accelerate the standard Viterbi algorithm. 

Lazy Viterbi [40] attempts to speed up computation of A* Viterbi by limiting the use of the priority 
queue. This results in Viterbi search with lower bound of O(1) and worst case of O(tn”). For estimation 


problems where the observation length is known a priori, and an admissible heuristic exists to guide the 


expansion, lazy Viterbi has been shown to speed up the A* calculation by as much as 50%. As with A* 
Viterbi, lazy Viterbi requires more implementation overhead. Depending on the cost of computing transitions 
and expanding nodes, the standard Viterbi algorithm can sometimes outperform the lazy method due to 


vectorization and compiler optimizations. 


B. VERY LOW OBSERVABLE TARGET TRACKING 


Very little data exists in the literature regarding subpixel target tracking. In order to compare tracking 
results with published data, it is necessary to look to similar challenging tracking scenarios that have received 
more attention. The VLO tracking field is similar to subpixel target tracking in many ways. A subpixel target 
exhibits low SNR in a given frame, while the VLO problem specifies very low SNR as the major obstacle 
to tracking. Additionally, a subpixel target provides very few pixels of data relative to the overall scene. 
While VLO tracking assumes that targets are larger than a pixel, most applications focus on targets between 
3 and 10 pixels across the largest dimension. This very low pixel count and low SNR makes tracking a 
difficult problem. These same difficulties exist in the subpixel problem domain to a greater degree. Given the 
similarity between VLO and subpixel domains, we review the VLO methods for possible application to the 


subpixel problem domain. 


A VLO target is one that has a very low peak SNR in the sensor data. While target size is not 
necessarily specified in VLO applications, small targets are typical due to their low overall photon flux relative 
to a larger target. To reduce noise, many tracking algorithms threshold the data so that sensor returns below 
a certain intensity are not considered by the tracker. Since VLO targets have intensity values that are close 
to or below the expected noise values, it is usually not possible to achieve acceptable detection probabilities 


without incurring high false alarm rates. These high false alarm rates make tracking much more difficult. 


In addition to low detection rate, VLO targets are also affected by background clutter. Clutter in an 
optical sensor consists of background objects in the scene such as buildings, roads, trees, or signs and is 
considered separately from sensor noise. A number of methods have been proposed to increase the signal- 
to-clutter ratio (SCR) in infrared imagery such as morphological filters [41, 42], wavelet methods [43] and 
expectation maximization (EM) [44]. Background subtraction methods in computer vision can also be applied 
to reduce SCR by modeling and subtracting a structured background from a sequence of images. Section D 


discusses background subtraction methods and the advantages and disadvantages of these methods. 


A particular class of filters designed to track VLO are the track before detect (TBD) filters. These 
filters focus on very small, low SNR targets down to, but not smaller than, a single pixel and typically incor- 
porate the point spread function (PSF) in their detection methods [45, 46]. This class of filters is considered 


to be the state of the art for VLO tracking methods. 


The TBD filter attempts to track a target using multiple detection probabilities over time rather than 
tracking detections above a given threshold. This allows the filter to integrate motion information over time 
with multiple detection likelihoods in order to build a better estimate of a target’s location. Since the sensor 
model is non-linear, TBD methods use a particle filter algorithm to perform estimation [47]. Theoretical lower 
bound performance of these filters indicate the ability to perform with subpixel accuracy, however, while 
many applications achieve subpixel estimation accuracy, most applications in the literature fail to achieve the 
lower bound performance [45, 48]. Since TBD filters use particle filter methods, they tend to be relatively 
fast, with computation time proportional to the number of particles used. The number of particles, however, 
also relates directly to the ability of the filter to converge on the correct estimate. As the number of dimensions 


or the size of the domain for any dimension increases, the number of particles required increases [49]. 


The drawback of using TBD methods is that filter performance degrades rapidly as the SNR falls 
below 7dB. As demonstrated by Rutten et al. [47], the TBD method produces RMS position error of no less 
than one pixel at 6dB, and no less than three pixels at 3dB. In contrast, the Cramer-Rao lower bound (CRLB) 
predicts the theoretical performance of the estimator to be better than 107! pixels at 4dB [48]. While TBD 


methods perform close to the the lower bound above 7dB, their performance suffers below this level. 
C. SUBPIXEL TARGET TRACKING 


Samson et al. address the detection of subpixel targets in a fixed scene [50]. They provides a method to 
detect subpixel point targets using the sensor’s PSF as an identifying characteristic. Using matched filtering 
theory, a measure is defined to determine the likelihood that a given pixel in an image contains a subpixel 
object. This measure is used to perform a detection decision by applying a threshold to the likelihood values. 
The Neyman-Pearson criterion to is used to determine the optimal threshold value for a given false positive 


rate. The resulting detector is shown to be greater than 70% accurate for images with SNR=14dB. 


The primary contribution of Samson et al. is the generalized matched-pixel filter (GMPF). This filter 
calculates the estimate of the unknown target photon flux, @, and returns the minimum sum of squared error 


estimate for the template centered at each pixel in an image. The GMPF is calculated as: 


nas (s€)TRNe(g9)) | 
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The variable s* is a 9x1 vector representation of a 3x3 pixel template. This template is generated by calcu- 
lating the ensquared energy of each pixel assuming a target with photon flux @ is located in the center of the 
template. This template is calculated by convolving the sensor’s PSF with a Dirac delta function representing 


a point target with intensity @. The variable R represents the sensor’s noise covariance matrix, and k(-) is 


a function that returns a 9x1 vector representation of the 3x3 pixel patch with center pixel of (x,y) in the 
observation z. Since the sensor noise is assumed independent and identically distributed (IID), R becomes 


Ir where I? denotes an identity matrix with diagonal of size 9. 


The GMPF provides the likelihood that a subpixel target exists in any given pixel. It does not provide 
an estimate of the subpixel location of that target. Since GMPF assumes a template with the target located 
in the center of a pixel, objects offset from the center will produce a lower overall likelihood. In order to 
calculate the likelihood of a pixel in any subpixel position, Samson et al. propose an approximate likelihood 
ratio test (ALRT) that uses the trapezoidal rule to approximate the integral: 
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While the ALRT method provides a measure of pixel-wise likelihood, it does not provide an estimate 
of subpixel location. To determine subpixel location, a second ML estimate must be performed for each 
pixel likelihood exceeding the specified threshold. A large number of pixels falling above the threshold limit 
causes a high computational load. Additionally, the subpixel position estimate does not incorporate any prior 
location probabilities or data from multiple observations over time. In order to use this method for tracking, 


it is necessary to specify a tracking method that can use detections provided by the ALRT method. 
D. BACKGROUND SUBTRACTION 


The background of a scene consists of all of the objects that are not necessary for tracking the target 
of interest. Background objects may be moving, but are generally assumed to be stationary with only small 
movements relative to the motion of the target. To eliminate spurious clutter in the estimation process, it is 


advantageous to subtract the background from the scene before performing inference. 


A variety of background subtraction techniques can be found in the literature. For this investigation, 
we focus on techniques that assume a mostly static background with the majority of changes between images 
occurring as the result of foreground objects moving through the scene. This is based on the assumption 
that most targets of interest will move with a speed that much higher than the surrounding background. For 
example, changes in illumination, cloud cover, and texture over a sequence of images will typically be much 


slower over a given time window than the changes caused by a vehicle or person moving through the scene. 


According to Cheung and Kamath, background subtraction typically follows the steps of preprocess- 
ing the image, modeling the background, detecting foreground objects, and data validation, with foreground 
masks as the final outcome [51]. Bouwmans labels these steps as modeling, initialization, maintenance, 


and foreground detection [52]. For each background subtraction technique, the main concerns that must be 


addressed are large-scale changes in illumination over time, shadows cast by moving objects, illumination 
changes due to environment such as snow, rain or cloud cover, and objects that merge into the background 
for a time, then return to the foreground (e.g., a car that parks for a period of time could be considered back- 
ground until it pulls out and starts moving again). Each of these problems can be addressed using a number 
of methods with varying strengths and weaknesses. The challenge is to determine which method provides the 


proper balance of computational complexity and accuracy under expected conditions. 


For the subpixel tracking problem, we assume a stationary, monochromatic sensor. Images consist of 
luminance values only. The targets we wish to track are generally slow moving, that is, less than a pixel of 
motion per frame, with a time delay between images that is sufficient to capture the target motion. Due to 
bandwidth and utilization constraints, it is not possible to obtain large numbers of images of the same scene, 


so methods that use relatively few images (<100) in a sequence are desirable. 


Bouwmans et al. identify 17 situations that cause basic background modeling techniques to fail [52, 
53, 54]. Of these situations, the ones of greatest concern in our application are noise, gradual illumination 
changes, bootstrapping (initialization or training on the same set of images that subtraction is required for), 
camouflage (objects with similar color or texture as the background), and multimodal background. Using 
these situations, methods proposed in the literature to deal with one or more of these methods were reviewed 


for use in our application. 


1. Basic Background Subtraction 


The simplest method of background subtraction is frame differencing. An image at time ¢ is sub- 
tracted from an image at time f — 1, and any pixel with a value greater than some threshold, T, is labeled as 
foreground. This method is fast, requires little memory, and can be performed in real time. On the other hand, 
selection of foreground objects is very sensitive to the selection of t. It must be set sufficiently high to filter 
out sensor noise, but in cases of shadows or highlights, this high setting could cause errors. This method also 
registers small background movements as foreground, such as leaves blowing in the wind, wave action, and 


cloud movement. 


The running Gaussian average proposed by Wren et al. is only slightly more complex than frame 
differencing [55]. This method calculates the per-pixel intensity distribution, VW ({1,,0;), over a period of k 


frames. A pixel is labeled as foreground if 


Ii — Lr| > Yor (8) 


where J; is the image at time f and y is a parameter chosen based on the desired sensitivity of the model, 


typically set to 2.5. The model is updated based on a weight, @, that determines how quickly and o 
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changes based on new observations. This so-called adaptation rate allows the model to be tuned for fast or 
slow background dynamics. A high value for & incorporates slow moving objects into the background model 
while a low value more likely classifies small changes as foreground for a large number of frames while 
the model adjusts to incorporate those changes. The advantages of this method are very low memory and 
computation requirements. This method addresses noise and gradual illumination changes much better than 
simple frame difference. Since the values of @ and y are fixed a priori, fluctuations in the background result 
in increased misclassification. Additionally, waking and sleeping foreground objects, as well as shadows 
and highlights produce high error rates unless @ is set to allow for rapid adaptation. A high adaptation rate, 


however, results in slow-moving foreground objects being classified as background. 


Lo and Velastin recommend using the median instead of the mean for Equation 8, arguing that it 
results in a closer model match than the mean. Likewise, Cucchiara et al. claim that the median value can be 
used to construct a background model using a subsampled frame up to a factor of 10 times smaller than the 
original frame [56]. While the median filter produces better results, it has the same disadvantages as the mean 
filter with the added disadvantage that the median cannot be easily calculated as a running median without 
storing all n previous frames. Additionally, the use of o; is no longer a statistically rigorous test relative to 


the median, so a fixed threshold T is used instead. 


To address the cost of median filter while maintaining the superior performance relative to the run- 
ning mean, McFarlane and Schofield [57] propose an approximate median method. The approximate median 
is initialized by setting the first image in the sequence to the median image. For each new image, the median 
image is updated by incrementing or decrementing any pixel in the median image that is less than or greater 
than, respectively, the corresponding pixel of the current image. This results in an approximation of the me- 
dian that can be calculated on the fly. Results reported in [52, 57] show that the approximate median filter 


performs similarly to the median filter at a substantial saving in memory usage. 


2. Statistical Background Subtraction 


One of the main disadvantages of these basic filters is that they assume a single mode distribution 
of pixel intensities. While this assumption holds for many pixels in a scene, it is possible for multiple 
background objects to appear in the same pixel over time. For example, a cloud will have one intensity 
distribution, while the ocean surface beneath the cloud will have a different distribution. While the basic 
methods will adapt to changes in cloud cover over time, it is also possible to use a mixture of Gaussians (MoG) 
to represent the background intensities. The first use of the MoG method was proposed by Stauffer and 
Grimson [58]. Power and Schoonees reformulate Stauffer and Grimson’s work with several corrections and 


discuss the theoretical underpinnings of the MoG method [59]. 
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The basic MoG method uses multiple Gaussian distributions to model both foreground and back- 
ground objects. A mixture is initialized for each pixel using EM and a predetermined number of Gaussians, 
K (usually between 3 and 7). Each Gaussian is assigned a weight, @, and parameters 0, = {Lx,0,}. For a 


given image, the closest matching Gaussian, k € K, for each pixel is chosen as 
k = argmax (@¢p(Z|k, 8)) (9) 


Z is the sensor observation. 


Once the current pixel state is estimated, the foreground distributions must be segmented from the 
background distributions. Since the MoG estimates both distributions, Stauffer and Grimson provide a fast 
procedure for performing segmentation based on the ratio ®/o,. Background distributions are expected to 
occur frequently; therefore, having a higher @, value, and they are not expected to vary dramatically so a 
lower 0; value is expected. Based on these assumptions, Stauffer and Grimson rank the states for each pixel 
using the ratio criterion. The first B ranked states with accumulated weight greater than a threshold, 7, are 


selected as background distributions with all others labelled as foreground. 


b 
B= argmin (x mr) (10) 


k=1 


To maintain the pixel-wise distribution mixtures over time, the model is updated at each time step 


using: 
1 N 
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ri P(X, ®) 
where o denotes element-wise multiplication and ® = {@...@x,0...0x}. 


To increase the speed of their algorithm, rather than calculate the exact value for p(k|X;,®) in eqs 
11, 12, and 13, Stauffer and Grimson define a match to be any value within Ao of the mean of the one of 
the distributions with A = 2.5 as the recommended value. Power and Schoonees recommend using hysteresis 
thresholding instead by designating an upper (A*) and lower (A— ) threshold where pixels that lie between the 
two values are labeled background or foreground based on an 8-connected morphological matching function 


[59]. 
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The MoG approach to background modeling has proven very popular in the literature with a large 
number of modifications to the original algorithm. For example Kim et al. attempt to characterize the back- 
ground model using Laplacian distributions rather than Gaussians [60]. A number of papers have attempted to 
overcome a fixed value of K Gaussians through estimation [61], simulation [62], and hill climbing [63, 64]. To 
improve model initialization in sequences where a “clean” background cannot be easily obtained, researchers 


have applied EM [65], approximate EM [66], [67], QR decomposition [68], and optic flow [69]. 


Another area of intense research is the adaptation rate of the algorithm. In its original embodiment, 
the adaptation rate is statically set at the beginning of a run. In order to deal with rapid scene changes, 
illumination changes, or poor initial model estimation, the adaptation rate must be very fast. On the other 
hand, high adaptation rates result in foreground objects being incorporated into the background model if 
they do not change rapidly enough. To address this problem, the idea of using foreground counters has been 
proposed in several different settings [67, 70, 71]. While each method differs in implementation, the core 
principle exploited is to limit the amount of time a foreground object can stop before it is subsumed into 
the background model. This allows for a slow adaptation rate for illumination changes while selectively 


increasing the adaptation rate for foreground objects that stop moving for a period of time. 


MOG estimation has two major drawbacks: initialization priors and adaptation rates. In surveillance 
applications where it is relatively easy to obtain images free from foreground objects, these images can be 
used to initialize the model. For situations where it is difficult or impossible to obtain a sufficient number 
of “clean” background images, the algorithm exhibits large error rates during initial foreground/background 
segmentation with results improving over time. For remote-sensing applications, it is possible to develop 
background models using images taken over multiple passes of the target region, however, it is generally 


impossible to guarantee target-free images. 


The adaptation rate for MoG must be slow enough that foreground objects are not subsumed into 
the background model. This creates a limit to how well the background can be matched with a small number 
of frames. While the number of frames in a test sequence is rarely discussed in the literature, a demonstration 
video released by Power and Schoonees contains hundreds of frames in the sequence. If we are limited in the 
number of frames we can capture, for instance one image per orbit, it will be costly to collect enough images 


for the background model to be build with sufficient accuracy. 


Building on the popularity and performance of MoG background subtraction, Elgammal et al. devel- 
oped kernel density estimation (KDE) as a nonparametric method of estimating the background distributions 
[72, 73, 74]. In KDE a histogram of pixel intensities is constructed over a number of frames. This histogram 
is then convolved with a kernel, usually Gaussian, with a set variance, h, called the kernel bandwidth. The re- 
sulting distribution is used in a similar manner to MoG to segment background and foreground pixels. Since 


it uses a histogram representation for each pixel, KDE does not require predetermined or adaptive values for 
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K. It also does not require estimation of [, or o,. These advantages are offset somewhat by the need to use 
a set bandwidth size, h. As for MoG, KDE requires a large number of frames to develop a good background 


model. It also requires special selection criteria for bootstrapping applications. 


E. MEASURES OF PERFORMANCE 


Methods of measuring estimation quality are necessary to validate the methodology presented in 
Chapter III. The literature provides a number of methods for comparing the tracking algorithm performance. 
Ristic et al., and Bessel use RMS error of each state variable relative to the ground truth state [45, 75]. Kalal 
and Wu, on the other hand, present real-time error detection methods that attempt to solve the tracking prob- 
lem from the current time step to the initial distribution P(Xo|Zo) without the need to compare to ground truth 
data [76, 77]. As part of the performance evaluation of tracking and surveillance (PETS) workshop, Bashir 
and Porikli discussed a suite of metrics for evaluating the performance of various tracking methods [78]. 
This suite provides a number of standardized evaluation metrics covering both frame-by-frame and total path 


performance. 


The metrics used in [78] are based on a bounding box that fully encloses the target’s footprint in 
the image. By using a bounding box, a simple test of estimation quality is to determine whether or not an 
estimation position falls within the ground truth bounding box. Using this bounding box method, Bashir 
and Porikli evaluate the count of true positives (TPs), true negatives (TNs), false positives (FPs), and false 
negatives (FNs) on a frame-by-frame basis. Using these metrics, it is possible to construct sequence-wide 
measures of quality such a tracker detection rate (TDR), specificity, accuracy, false alarm rate (FAR), and a 


number of other measures. 


In addition to frame-based metrics, [78] introduces object-based metrics. These metrics use the entire 
estimated path of an object as opposed to individual frame-wise estimation points. For example, the path- 
based version of TP uses the mean Euclidean distance between the ground truth path and the estimated path 
for all overlapping time steps to determine whether or not a TP has occurred. As with the frame-based 
methods, Bashir and Porikli define the metrics TP, TN, FP, and FN for objects. Using these metrics, TDR 
and FAR are reported on a path-wide basis. Additionally, the metric object tracking error (OTE) is calculated 
as the RMS distance between the estimated object location and the bounding box centroid. The metrics 
developed in [78] provide a starting point for evaluating the quality of various tracking methods. Section A 


discusses modifications made to these methods to adapt them for subpixel tracking scenarios. 
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KE CONCLUSION 


The literature provides a number of methods that can be applied to the subpixel problem. While many 
of these methods do not apply directly to the problem domain, with small modifications, tracking techniques 
such as MAP inference using the Viterbi algorithm can be adapted to work with subpixel targets. The methods 
developed in Samson et al. [50] are particularly useful for the development of a likelihood function to apply 
to the MAP estimation. The various background subtraction methods, while not developed specifically for 
subpixel foreground objects, provide a means to remove spurious background intensities from the image, 
making the tracking task somewhat easier. Finally, metrics developed by Bashir and Porikli offer a starting 


point for measuring performance of the methods developed in Chapter IIT. 


15 


THIS PAGE INTENTIONALLY LEFT BLANK 


16 


Hil. METHODOLOGY 


This chapter details the pixel-matched Viterbi (PMV) algorithm for tracking subpixel targets. Sec- 
tions A and B define the target and sensors models used to calculate the target state likelihood. Section C 
discusses the motion model used for the experiments in Chapter IV. In Section 1, the Viterbi algorithm is de- 
fined for calculating the maximum a posteriori probability (MAP) of the target states given the target, sensor, 
and motion models. Finally, the distance transform algorithm used to achieve linear time complexity for the 


Viterbi algorithm is explained in Section 2. 


A. TARGET MODEL 


A target consists of any object of interest smaller than the size of a single pixel. The target is assumed 
to present a surface area to the sensor, as well as to reflect or emit some amount of radiation in the sensor’s 
spectral band. The amount of surface area and spectral response, that is, the amount of spectral energy 
reflected or emitted by the target at each point on its visible surface, is assumed to be unknown. The target’s 
intensity contribution, a, is determined by integrating the target’s spectral response over the total visible 
surface area. Ignoring the affect of the sensor point spread function (PSF), the contribution of the target 
to a given pixel intensity is the sum of the background’s total non-occluded intensity, and the target’s total 
intensity, @ + B. Since the target is subpixel, the value of @ is the only target parameter measured by the 
sensor. In fact, since both target visible area and spectral response are unknown, it is not possible to determine 
the size or spectral response of a subpixel target without making further assumptions about the target. Since 
a target area cannot be determined using intensity alone, it is possible without loss of generality to specify 


the target as a point source total intensity, a. 


A target’s signature is defined as 


T (x,y) = @6(x,y) (14) 


where 5(x,y) is the Dirac delta function at location (x,y). 


B. SENSOR MODEL 


A sensor consists of a set of sensing elements, called pixels, arranged in a grid format, such as a charge- 
coupled device (CCD) or complementary metal-oxide-semiconductor (CMOS) device. Sensor spectral range 
is used to determine the size of the PSF, and is easily specified for different sensor types. This means that the 


methods discussed in this chapter can be applied to infrared (IR), radio detection and ranging (RADAR), or 
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standard optical sensors provided that the sensing elements are designed to determine the photon flux at each 


pixel site during a fixed integration interval, and the sensor exhibits a PSF due to aperture affects. 


To use the sensor’s PSF as a target signature, the sensor must be critically sampled or oversampled. 
Since a critically sampled sensor represents the smallest number of pixels that can be used to accurately 
sample the PSF, the remainder of this work assumes a critically sampled sensor. As discussed in Chapter 
II, this means that the size of each photosite is calculated as the Nyquist rate necessary to sample the full 
bandwidth of the sensor. The maximum sensor bandwidth is determined by the PSF. According to the 
Rayleigh criteria, the maximum spatial frequency that a sensor can sample is determined by the distance 
from the maximum intensity point of the PSF to the first zero ring [79]. For an Airy disk, this distance is 
1.22AN, where A represents the mean of the sensor’s spectral frequency range, and N is the ratio of aperture 
diameter to focal length. According to the Nyquist sampling theorem [80], the optimal sampling rate, r,, 
for the PSF 2.444.N. By setting the size of the sensor photosite to !/;., the sensor is critically sampled. It 
is common practice to design satellite and optical sensors for remote sensing applications using this criteria 


[81]. 


1. Approximating the Point Spread Function 


The PSF of a sensor with a round aperture is an Airy disk. This function is calculated using the 


Bessel function of the first kind [79]. 


(15) 


24, (kasin @) \? 
kasin@ 


PSF(@) = Ip ( 
The variable k is 27/1, a is the diameter of the aperture, and @ is the angle between a line drawn perpendicular 
to the aperture through the aperture center, and the line drawn from the center of the aperture to the observa- 
tion point. The term kasin @ can be rewritten as 79/an, where q is the radial distance on the sensor plane from 
the line drawn perpendicular to the aperture through the aperture center. Figure | shows the relationship of 


these values. 


In Section 3, we describe the generation of templates for matching the target signal. A template is 
determined by calculating the total area under the PSF that is covered by a single pixel. In order to calculate 
these templates quickly, we approximate the PSF using a Gaussian function. The PSF is a a rapidly decaying 
sin function with full width at half-maximum (FWHM) of 1.028AN where A is the mean of the sensor’s 
spectral bandwidth, and N is the ratio of focal length to aperture diameter. While the tails of a Gaussian do 
not fall to zero like the PSF, a Gaussian with FWHM equal to that of the PSF is a close approximation. The 
FWHM of a Gaussian occurs at 


FWHM = 02V21n2 (16) 
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Figure 1: Illustration of the aperture geometry used to calculate the PSF. 


To approximate the PSF of a sensor, a Gaussian is specified such that the Gaussian FWHM is the 
same as the PSF FWHM. This can be accomplished by setting Equation 16 equal to the FWHM of the PSF 


and solving for o. 


1.028AN 
ae 17 
2V2In2 ~~ 


Figure 2 shows the Gaussian approximation overlaid on the PSF. 


2. Background Subtraction 


An observation from the sensor consists of four elements: the sensor’s PSF, the target contribution, 


the background contribution, and sensor noise. These are related by 
z=h@(T(x,y)+B)+a@ (18) 


where z is sensor output, 4 denotes the PSF, @ is the convolution operator, T(x,y) is the target model, 
B is a static function of the background scene intensity, and @ ~ -/(0,r) is independent and identically 


distributed (IID) for each pixel. 


Using Equation 18, the contribution of the target and background are combined in the final PSF- 
convolved scene. Since the background is assumed to be static, the estimation problem is simplified by 


removing it from the scene. Using the distributive property of convolution, Equation 18 is rewritten 


z=h® f(x) +h@B+o (19) 
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Figure 2: Comparison of Gaussian and PSF. 


Equation 19 separates the background contribution from the target contribution without the need to decon- 
volve the image first. This makes it possible to remove the background contribution from the scene using 
only estimates of the background determined from the observation frames. 

As discussed in Chapter II, a number of methods exist for performing background subtraction. Many 
of these methods require a large data set of background images in order to build a model. While a good 
background modeling choice improves the signal-to-clutter ratio (SCR) of a scene, a bad model can remove 
pixels necessary for target estimation. Based on the results of experiments described in Chapter IV, we chose 
to use the median filter method of background subtraction due to its simplicity, high speed, and relatively 


high F-score. The background subtracted observation is denoted 


z=z-B (20) 


3. Sensor Likelihood 


We formulate the likelihood measure using z. Due to the PSF, the target signal will be spread into 
neighboring pixels. This signal can be used to estimate the subpixel location of the target by comparing the 
expected intensity values of a pixel and its neighbors with the observed pixel values in a frame. To estimate 
the intensity values of a central pixel and its neighbors, we generate a template using the ensquared energy of 


the Gaussian approximation to the PSF. 
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Figure 3: Templates generated by a subpixel target at (—!/4,-1/4),...,(1/4,1/4). The dot indicates the actual 
target position for each template. 
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For square pixels with extents of [—0.5,0.5), the ensquared energy of a pixel is the fraction of the 


volume of the Gaussian sampled by an individual pixel and the total Gaussian volume. 


fees Sosexp (— aia j 


fe Jesexp (-§ me F) did 


EE(x,y,0) = 





(21) 





The denominator of Equation 21 is the inverse of the normalizing constant for a bivariate Gaussian distribu- 


tion. Since the pixel noise is IID, this can be further simplified. 


ts Bs a 
e+] .7.__ 20 
i / ae (- a ) awa = 2 (22) 


The numerator of Equation 21, however, does not have an exact solution. This is calculated using 





an adaptive Simpson quadrature technique as implemented in Matlab® 2011. 


For a critically sampled sensor with a target in the center of a pixel, a 3 x 3 template will sample 
99.92% of the total target energy. As the target moves to a comer of the center pixel (e.g., (—0.5, —0.5)), 
the template contains a total of 98.16% of the total target energy. Since the target energy is very close to the 
sensor noise at low signal-to-noise ratio (SNR) values, a larger template size does not increase the amount 
of target information enough to justify the increased computational cost of a larger template. Figure 3 shows 


the ensquared energy of the PSF over a 3 x 3 pixel template for a number of subpixel positions. 


The likelihood function is calculated by generating a set of templates with the target located in a 
discrete set of locations. Discrete positions are calculated from a uniform grid within a single pixel. A 
spacing parameter, p, determines how many subdivisions to use. The discrete subpixel locations are the set 
of points (i, j) € € x €, where € = {-!/2,-1/24 1/p,-1/242/p,...,-!/2+P-1/p}. For example, if p = 5, a total 
of 25 templates are generated for target locations (x,y) € p x p, with points p = {—0.5, —0.3, —0.1,0.1,0.3}. 

The likelihood of a subpixel target located at a given position within a pixel is calculated using a 
template matched filter. The matched filter is an optimal method for calculating the correlation between a 
finite template and an infinite input [82]. In the case of additive white Gaussian noise (AWGN), the matched 
filter in the two dimensional discrete case is the sum of squared error between the template, .7, and a window 


centered at a given pixel location, 7“ ). The size of the window, N, is the same as the template size. 


FT ozs) & y YF (u,v) —2(x+i,y+j))/? (23) 


i=—N j=—N 


Given a set of templates, S, for each template, 7 € S, the correlation is calculated over each pixel in z. Since Z 


is corrupted by noise, @, Equation 23 is rewritten as the least squares estimator using matrix notation. With a 
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slight abuse of notation, 7 and Z“ ) are used to represent column vector representations of the corresponding 


3 x 3 matrix used in previous equations. 
F oz) & JT R170) (24) 


Due to changes in scene lighting, camera white balance settings, shadows, and other changes in 
scene intensity, the template intensity and observation intensity will not match. This leads to a problem where 
a template is an exact match for a window centered at pixel (x,y) up to a proportional constant. Furthermore, 
since the intensity of the target in unknown, the template does not match the target intensity either. Samson et 
al. address this problem by calculating the maximum likelihood estimate of the target intensity for any given 
window [50]. 

For any given template, 7, the actual target contribution is @.7 and the noise contribution is @ ~ 


WV (0,R) (from Equation 18). From [50] the maximum likelihood estimate for a is 





FERN zy) 
a= FIRAF (25) 
Combining with Equation 24, the likelihood equation becomes 
3 2 
é (7 e)PRo (Z9)) 
(x,y) = eat : : = 
é (: i | a oe! | ) (FUD)TRITED ee 


The full state is specified for notational convenience. The target velocity is not used in the likelihood 


function; therefore, the likelihood of a target position is independent of the velocity. 


Since R is a diagonal matrix of IID noise r, Equation 26 can be rewritten as: 


Leng Fld oZe9) 
eer = [i cea )2aieze oe 


Using the correlation operator, o, in Equation 27 allows the likelihood to be calculated using existing, highly 





optimized code for cross-correlation operations. While normalized cross-correlation is more commonly used 
in template matched filtering, the normalization occurs for the template and image combination and does not 
apply to multiple templates. To allow for direct comparison of results between multiple templates and the 


image, only standard cross-correlation is used to calculate the likelihood. 
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C. MOTION MODELS 


The PMV algorithm works with both linear and non-linear motion models with additive Gaussian 
noise. To test the performance of PMV, we define the nearly constant velocity (NCV) model from Bar- 


Shalom [5]. The following provides a brief description of this model. 


The NCV model assumes that a target is moving at a fixed velocity with noisy changes in velocity. 
The state for this model consists of position and velocity in the x and y dimensions. To avoid ambiguity 


between position and an instance of the random variable X, the state values are denoted using i and j instead. 


ve 


x=| i Ai j Aj (28) 
The state update equation is given as: 
Xp =Axp_1 +N (29) 
With a linear transition matrix 
1 1 0 0 
0 1 0 0 
A= (30) 
0 0 1 1 
0 0 0 1 
The noise, 77, is additive Gaussian with covariance matrix Q. 
1 ~ 4V(0,Q) (31) 


fAk3 SAK? 0 0 
IAK2 gAk =O 0 
g-|7" * (32) 
0 O £Ak> LAK? 


0 O Ak? qAk 
Since 7) is a Gaussian, the motion probability for this model is: 


—1 


P(Xi|X;—1) = (2m) ~2|Q|~2 exp 2% AX)" OK —AXe-1) (33) 





While the NCV model is a linear model, it is also possible to use the Viterbi algorithm with nonlinear motion 
models provided they exhibit additive Gaussian noise. The methods explained in Section 2 explain how a 


linear or nonlinear motion model can be used. 
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D. SOLVING FOR MAXIMUM A POSTERIORI PROBABILITIES 


Given a hidden Markov model (HMM) representing the distribution of target state and observations 
over time, it is possible to calculate the most likely state given a set of observations and previous states 
(maximum likelihood (ML) estimation), or to calculate the most likely sequence of states to produce a given 
sequence of observations (MAP). The PMV algorithm performs the latter calculation. In Bayesian theory, 


the most likely sequence of states is calculated from the full joint probability, p(Z)..+,Xo...). For brevity, the 


subscript notation Xj, 1s used to indicate the sequence Xy,, Xn+1,--.,Xm- 
t 
P(Zi..0,Xo...1) = Pp (Xo) |] p(ZlXe) p(X Xe—-1) (34) 
k=1 


If the realized values for random variables Z,_, are known, and the distribution for Xp is known, then 


the inference problem requires solving for the values of X._; that maximize Equation 34. 


t 
MAP(X;.. 1) = argmax p( Xo) IT» (Zx|Xx) P(Xx|Xx-1) (35) 


The argmax operator can be moved inside the product by noting that the product in Equation 35 is 


only maximized when p(Z;|X;,) p(X;|X,_1) is maximized for each X;. Thus 
t 
MAP(X1....) = p(Xo) [ T argmax p(Zx|Xe) p(Xe|Xe-1) (36) 
k=1 k 


Equation 36 exhibits the two criteria necessary to use dynamic programming methods: optimal substructure 
and overlapping sub problems [32]. If the random variable X; is discrete, then the solution to Equation 36 
can be calculated in time O(t-|X|*) using the Viterbi algorithm (the |-| operator in this case refers to the 


cardinality of the domain for the random variable X) [32]. 


1. The Viterbi Algorithm 


Using sensor model likelihood in Equation 27 and motion probability in Equation 33, the MAP is 
calculated as 


t 
Xi. = p(Xo) | Targmax €(Z |X) p(Xx Xe) (37) 
k=1 


The naive implementation of this calculation requires O(n') time where n = p*wh, when each frame has size 
w xh. This implementation is computationally intensive for more than a few frames. Since the number of 
possible states has been discretized, the Viterbi algorithm can be used to optimally calculate the MAP in 


O(tn”) time instead [32]. 
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Figure 4: Example of the Viterbi Algorithm for a 4 state lattice over 4 time steps. The solid red arrows represent 
the maximum prior probability for each state. The thicker solid black arrow represents the sequence of 
states resulting the final MAP estimate. 


The Viterbi algorithm is a dynamic programming method that calculates the optimal path through 
an acyclic directional graph given node and edge costs [33]. The HMM representing each possible discrete 
target state can be viewed as an acyclic directional graph where each state at time ¢ connects to each state at 
time t+ 1. The node value for each state is calculated using Equation 27, and edge weights from f to t+ 1 
are calculated using Equation 33. The Viterbi algorithm calculates the MAP of the states by selecting the 
maximum inbound edge for each successive set of states. The states’ weights are then updated by multiplying 
the weight of the inbound edge with the current state likelihood. Outbound edges are calculated as the 
product of the motion model probability and the node weight. The maximum probability node in the last time 
sequence represents the final node of the MAP sequence. A simple backtracking step follows the maximum 
inbound edge for each node, providing the final sequence of states in the MAP. Figure 4 shows an example 


of the Viterbi algorithm for a simple 4 state lattice over 4 time steps. 


The state used in the motion model consists of both position and velocity. The sensor observation, 
on the other hand, only provides an estimate of the target’s position. This means that the velocity must be 
determined from the position estimates. For the NCV model, the new velocity is calculated from x;,and x;_1. 
Since discrete positions are used for the state, the possible velocity values are also discrete. In this case, the 


current velocity for a given x; is simply calculated from the prior x;_1. 


For non-linear motion models with additive Gaussian noise, the parameter space is sampled uni- 
formly to build a histogram of possible transitions. The maximum bin in the histogram is the maximum 
mode of the distribution, and the additive Gaussian noise results in a Gaussian distribution with mean at the 
location represented by the maximum bin. The distance transform (DT) algorithm discussed in Section 2 can 


be used for this motion model. 
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2. Distance Transform Optimization 


Assuming the noise in Equation 33 is additive Gaussian noise, the DT algorithm presented in 
Felzenswalb and Huttenlocher [83] reduces the computation time from O(tn?) to O(tn). The DT algorithm 
calculates the maximum or minimum manifold of a set of parabolic or conic functions over a given domain. 
The set of functions consists of tuples (h,k), where f(x) = a(x—h)* +k. The variable a determines the size 
and direction of the parabola. To apply the DT algorithm, the value of a must be constant over all parabolas 


in the set. 


The DT algorithm only works in one dimension, however, since Q is a Gaussian covariance matrix, 
it is separable (i.e., it can be expressed as the outer product of two vectors, v and h), and thus it is possible to 


run DT once for each dimension of the state—in this case the x and y axis of the target position. 


The distance transform algorithm calculates the maximum or minimum manifold of a set of parabolic 
or conic functions over a given domain. The set of functions consists of tuples (1,k), where each tuple repre- 
sents a parabola of the form 


f(x) =a(x-hy +k (38) 
The variable a determines the size and direction of the parabola. To apply the distance transform algorithm, 
the value of a must be constant across all parabolas in the set. 


The following is a proof that distance transform calculates the maximum manifold of p(X,_1)p(X«|Xx—1). 


This proof assumes that Q is separable. Based on this assumption, we only consider the first dimension of 





a ; : eT : gates as 1 (xm)? 
P(Xp_1)p(X_|X~_1) without loss of generality. This dimension has the distribution age ox? ( 20? , 
with [ Mi bo bp ba ia = Axg_1, and 0) = Qi). 

Theorem D.1. p(X¢_1)p(Xi|Xx_1) can be expressed as a set of parabolas {(hy,k,),(h2,k2),---,(Anskn)} 


with common a 


Proof. Taking the log yields: 
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Setting the values for a, h, and k results in Equation 38: 


-1 


a=— 
20? 


h=Wy 





k = log (p(X)) +log 


1 
,/2m07 


f(x) = 7g? M1) + og(p(%)) + log 





20? 


Since the Qj; is constant for each state transition, the value a = = is constant for all parabolas in 
1 


the dimension, satisfying the requirement for DT 


f(x) is in the form of the parabolic equation; therefore, p(X,_1)p(Xx|Xx—1) can be expressed as a 


set of parabolas 


Felzenswalb and Huttenlocher show that DT calculates the maximum manifold of a set of parabolas 


parameterized as {(/1,k1),(h2,k2),.--, (Mn, kn)} with common a 


The log operator is monotonic; therefore, the maximum over the set of f(x,) is the same as the 











maximum over p(X¢_1)p(X¢|Xx—1)- 





E. CONCLUSION 


This chapter provides the methodology developed to track subpixel targets with critically sampled 
optical sensors. Using the sensor PSF target templates are generated over a discrete set of points. The 
likelihood function is developed using cross-correlation and maximum likelihood methods. A motion model 
is defined using the NCV model. We describe the use of the Viterbi algorithm to calculate the MAP of the 
target path. Finally, the distance transform algorithm is applied to the motion model as an optimization to the 
general Viterbi algorithm. Chapter IV describes experiments designed to test this method against synthetic 


and real-world data sets. 
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IV. DESIGN OF EXPERIMENTS 


This chapter describes the experiments performed to validate the following claims regarding the 


pixel-matched Viterbi (PMV) algorithm: 


¢ Estimated paths are better with respect to root mean squared (RMS) error than track before detect 


(TBD) methods 


* The distance transform algorithm results in O(n) time with no decrease in solution quality as number 


of sensor pixels increase 


¢ PMV exhibits a higher tracker detection rate (TDR) than TBD 


In designing experiments to validate the claims made regarding PMV, we first defined metrics to compare 
performance between PMV and TBD. Next, we tested background subtraction to determine which method 
provides the best balance between computational load and accuracy. Using the chosen method, we then 
determined the affect of the tuning parameter p on estimation accuracy versus running time. Using the best 
value of p, we tested PMV and TBD against three different sets of imagery data. The methods used to 
generate each data set are explained in Section D. Estimation error for PMV and TBD for each data set are 


reported in Chapter V. 


A. MEASUREMENTS 


Results from experiments are reported using a modified form of metrics from Bashir [78]. Since the 
PMV algorithm does not attempt to make a detection decision, it is assumed that a target always exists in the 
imagery under test. Based on this assumption, true negative (TN) will always be 0. Likewise, values for true 
positive (TP) and false negative (FN) must be measured in terms of a bounding box around each ground truth 


state rather than as a detection measurement. 


Bashir uses a bounding box concept to compute TP and FN [78]. An estimate is counted as a true 
positive any time the estimated location of the object lies within a box that fully encloses the target. For 
subpixel objects, the bounding box can be determined by the Rayleigh criteria as 1.22A.N. For a sensor with 
re = 2.44, this means that the bounding box will be 2.44 pixels on the side. Using this measure, the upper 
left corner of the bounding box is X; — (1.22, 1.22), and the width and height of the bounding box are 2.44 
and 2.44 respectively. Any estimate that falls within this box is counted as a TP while estimates outside of 


the box are counted as FNs. 
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Using this definition of a bounding box we calculate the detection rate for a single path as the ratio of 
TPs to total frames, total number of framess (TFs). 
f tp 
Tracker Detection Rate = if (39) 


In addition to point-wise statistics, the total path error is used to describe how far the estimated path 


is from the actual path. The standard measurement of error is RMS. 


RMS = 2 


x (Xj; — Xx)? (40) 
While Equation 40 gives an easy to compare value for path error, it can be easily corrupted by large outliers 
and constant estimation bias. To eliminate outliers and correct for any constant bias, we also report the RMS 
error calculated for only the TP estimates, RMS—TP. This measure quantifies the quality of the estimator 
when it has converged on the true target. An estimator with a higher RMS—TP is providing a lower error 
estimate when the TP rate is equal. A high detection rate, coupled with a low RMS—TP, indicates better 


quality estimation than a low RMS with low detection rate. 


Constant estimation bias refers to paths that follow the ground truth direction relatively accurately, 
but are offset by some amount. These paths will result in very low detection rate and very high RMS despite 
tracking the general trend of the path accurately. To determine when this situation occurs, we calculated 
RMS—LR for each path. The RMS—LR measure determines the Ax, and Ay offsets that result in the least 
RMS error. 





1_ Fe 
RMS—LR = argmin i y (dist(X,; — (Ax, Ay), Xx))? (41) 
HOY VY kal 


If the value of RMS—LR is lower than the value of RMS, then the estimator exhibited a constant bias in the 


estimate. If the value of RMS—LR is equal to the value of RMS, then the estimator is unbiased. 
B. BACKGROUND SUBTRACTION 


Background subtraction is a necessary pre-processing step for both TBD and PMV. As described 
in Chapter II, many approaches have been developed for this problem, however, none were developed with 
subpixel targets in mind. Methods such as mixture of Gaussians (MoG) or Eigenbackgrounds have a high 
probability of classifying the tails of the target’s point spread function (PSF) convolved projection as back- 
ground. Additionally, different methods’ performance regarding local background movement is highly vari- 
able. Due to these challenges, we performed experiments on the median, approximate median, and mixture 


of Gaussians methods to determine which allows for the best overall estimation performance. 
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This experiment used synthetic imagery containing a known subpixel target and a background. To 
generate the background scene, we use the background subtraction data set developed by Brutzer et al. [84]. 
This data set provides synthetically generated scenery for each of the background subtraction challenges 
outlined in Chapter II. Specifically, we focused our tests on the data sets for camouflage, darkening, and 
night. 

The quality of various background subtraction methods has been measured in a number of different 
ways. We are interested in ensuring that the method chosen eliminates as much of the background as possible 
while eliminating as few of the pixels associated with a subpixel target as possible. This means ensuring 
the method chosen not only classifies the central pixel containing the target as foreground, but that it also 
classifies that pixel’s immediate neighbors as foreground, as those neighbors contain valuable information 
on the target position via the PSF. To ensure this criteria, we scored each method for precision and recall, 


placing a high weight to the importance of precision. 


To calculate precision and recall, we counted the number of TPs, false positives (FPs), and FNs for 
each test image. A pixel is counted as a TP if the method identified it as background and it was actual 
background. A FP is counted for each pixel that is identified as background but is a pixel containing a 
subpixel target, or one of the 8-connected neighbors of that pixel. A FN occurs when a pixel that is known 


background is not classified as background. 


To score the methods, we used a F-score criteria. We first calculate precision as 


tp 








p= (42) 
tp+fp 
and recall as 
tp 
r= 43 
tp+ fn Se 
and finally the F-score as 
poo (44) 
ptr 


Using this method, the median background subtraction method was determined to have the best overall 


desirable characteristics, and this method was used for all further tests. 


1. Data Set Modifications 


Since the data sets provided by Brutzer et al. [84] do not contain subpixel targets, we created a 

modified set for testing. Using the ground truth background images, a 7x7 circular target was inserted in 
T 

each image using a constant velocity motion with xo = 792 —2.74 586 —3.51 | . The intensity of 


the target was set to mean image intensity for each frame. This allows the target to fluctuate intensity with the 
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overall scene. The images with targets where then subsampled with each new pixel calculated as the mean of 
each 8 x 8 distinct window in the original image, resulting in a 100 x 75 pixel image. This subsampled image 


was then convolved with a Gaussian PSF with full width at half-maximum (FWHM)= 1/2V/21n2. 


Using the frames with targets, we generated two sets of data, one with static background, and one 
with dynamic background. The original ground truth data consists of a street scene with buildings, roads, 
traffic lights, and a tree. The tree is considered background, and its leaves blow in the breeze. The dynamic 
background data set uses the full ground truth frame to include the moving tree. The static background data 
set uses the dynamic data set with the tree cropped out of the frames. The crop rectangle for the static data 


set has upper left corner of (53,0) and lower right corner of (100,75). 
C. PIXEL DISCRETIZATION 


The PMV algorithm discretizes pixels into p x p subpixel locations. The quality of the solution and 
computation time are both affected by the value of this discretization. To determine the affect of p on the 
characteristics of the algorithm, we conducted a series of tests over varying values of p. The test data consists 
of synthetic imagery generated at 20, 10, and 5dB signal-to-noise ratios (SNRs) as described in Section 1. 


The value of p is expected to have the same effect regardless of SNR value. 


For each observation sequence, the tracker detection rate, RMS and RMS—TP were calculated, as 
well as the total computation time for each sequence. The expected result is that increases in the value p 
result in increased computation time and asymptotically decreasing RMS. The tracker detection rate should 


not be affected by changes in p. Section B provides results of these tests. 
D. DATA GENERATION 


To systematically test the quality of PMV on subpixel targets, we developed three different data sets. 
The first data set, consisting of synthetically generated image sequences, was used to test the algorithm over a 
series of SNRs ranging from 1dB to 20dB with known target states. The synthetic data set does not simulate 
a background. The second set of data, quasi real-world data, uses real-world imagery of larger than subpixel 
objects. This data is then processed to convert the targets to subpixel size. The final data set consists of 
real-world imagery of a known subpixel-sized target in a scene.The following sections describe how each 


data set was generated. 


1. Synthetic Data Generation 


Synthetic data was generated for three different image sizes, 30 x 30, 85 x 85, and 200 x 200. For 


each image size, sets were generated with SNR ratios ranging from 1dB to 20dB. The same ground truth 
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Simulated Noise xX 
Known Ground Truth xX xX 
Static Background x x 
Dynamic Background x x 
Non-simulated xX 





Table 1: Parameter values used for synthetic imagery generation. 


target motion was used for all generated data sets, regardless of size or SNR. To account for the affects of 


stochastic noise, a total of 20 individual sequences were generated for each combination of image size and 


SNR. For all images, the PSF is approximated using a Gaussian with o = aaa which approximates the 





PSF of a perfectly circular aperture and assumes that the pixel size is set to = = 0.4098AN. The value for 


AN is set to 1 for convenience. 


To generate a synthetic image sequence, the target signature is generated on a blank image using the 
ensquared energy (Equation 21). This simulates a Gaussian-convolved target signal located in a known loca- 
tion with no noise. Next, image noise is calculated for the desired SNR using the maximum target intensity 
value after convolving with the PSF. This calculates the noise for peak SNR over the image sequence, guar- 
anteeing that the SNR will never exceed the specified value, but due to intensity distribution over multiple 


pixels, the frame-level SNR can be less. 
o = max(EE(Xg))10 20 (45) 


Using the calculated value for o, for each pixel in the image we sample from ./ (0,07) and add the result to 
the pixel’s intensity. 
The ground truth used for all of the synthetic images is linear, starting at position (2.0, 2.0) and 


moving with velocity of (0.5,0.2) for a total of 54 time steps with At = 1. Target intensity is set to a@ = 1. 


2. Quasi Real-world Data 


The quasi real-world data consists of two data sets captured with an infrared (IR) camera. In both 
data sets, the captured targets are larger than subpixel. To simulate subpixel targets while still comparing to 
target ground truth, the maximum dimension of the target was determined by manual inspection. Given a 
maximum dimension, d, the image size is reduced by half by calculating the average intensity of each group 
of 2 x 2 pixels. The resulting image is halved again using this method. We perform a total of n halvings, such 
Z24,. This produces a lower pixel count image that is equivalent to capturing an image using a sensor 


that nt 


with a pixel pitch exactly 2” larger than the actual sensor pixel pitch. 
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The size reduction of the images removes the PSF of the sensor. In order to use the quasi real-world 


images, we convolve the final image size with a Gaussian filter with o = ae This results in an image 





with real-world target signature and background, but a simulated PSF. These sequences are used to test 
background subtraction methods on a reasonably challenging background. Additionally, since the target is 
large enough in the full size images, it’s center of mass can be determined manually, providing ground truth 


data for the targets. 


Two quasi real-world data sets are used, one provided by the Ohio State University IR database (data 
set 05, plane motion) [85], referred to as the “plane” data set, the other captured locally using a FLIR SC 
8200 mid-wave infra-red (MWIR) camera of a city street using a 50mm lens at f/4 aperture setting, referred 
to as the “car” data set. The plane data set consists of an aircraft flying away from the camera from above a 
cloud bank into the clouds. The clouds in this set are moving throughout the sequence, and present a dynamic 


background. This set contains a total of 491 images of size 80 x 60. 


In the car data set, images were captured of cars driving on a straight road. The images were 
captured from the top of a 4 story building, looking towards the street, which goes up a hill 1.75 miles away. 
The resulting perspective shows movement up the hill as vertical motion in the sequence. This sequence 
contains a cluttered, but mostly static background. The car data set contains a total of 301 images of size 


32 x 51. Figure 5 shows three frames from each of these sequences. 


3. Real-world Data 


Real-world data was captured using a FLIR 8200 MWIR camera with 50mm lens and f/4 aperture 
setting. According to the manufacturer, the pixel pitch of the sensor is 18um. The spectral range of the 


camera is 3 — 5m wavelength frequencies. This results in a PSF with FWHM of 


FWHM = 1.028AN 
= 1.028-4-4 


= 16.448 um 


The full width to first zero ring is 


FW = 1.22AN 
=1.22-4-4 


= 19.51um 


34 






Figure 5: Three frames for the “plane” and “car” quasi real-world data set. A box is drawn over the target in 
each frame. The top row contains frames 1, 245, and 491 from the plane data set. The second row 
contains frames 1, 150, and 301 from the car data set. All images have been contrast enhanced for 
clearer viewing. 








Using these numbers, the pixel sampling rate is FW/pitch = 1.36. This means the camera is undersampled for 
the PSF size, however, since the full width of the spot is greater than the pixel pitch, the target intensity is 


still spread to more than one pixel. 


For this data set, an IR emitter in the 344m wavelength is used as a target. The emitter is a SVF360- 
8M LED from Cal Sensors, Inc. The 14mm diameter sensor was placed on a vertical holder 97m from the 
camera. The LED was powered at 1.25V using a constant current circuit, with a 60% duty cycle, 10kHz, 
pulse width modulation (PWM) signal to control emitter intensity. To simulate linear motion, the emitter’s 


holder was attached to a Pioneer P3DX robot set to move at a constant rate of 0.4m/s. 


We attempted to estimate ground truth of the sensor location using a larger heat source on the robot 
that is offset a constant distance from the emitter. The manually estimated path, however, exhibited higher 
variability than either the PMV or the TBD estimated path. Rather than use the manual path as a ground truth, 
we instead compare the paths generated by the two methods to each other. Inspection of the image sequence 


with the overlaid path shows that both methods provide realistic estimates of the emitter’s path. 


To avoid adding multiple targets to the image, the final sequence of images used was cropped to 


remove the robot and heat sink from the test imagery. Figure 6 shows three frames from the test sequence 


. a F 


Figure 6: Frames 47, 87, and 126 of the real-world data set. A box is drawn over the target in each frame. All 
images have been contrast enhanced for clearer viewing. The emitter is located ~12 pixels above the 
bottom edge of the image. 


with the heat sink included, and three frames of the cropped sequence containing only the emitter. The 


real-world image sequence consists of 126 frames of 59 x 14 sized images. 
E. IMAGE SIZE VS. COMPUTATION TIME 


The size of an image is expected to impact the quality of the TBD method while having little to no 
impact on the quality of the PMV method. On the other hand, TBD is expected to show nearly constant run 
time over different image sizes, while PMV is expected to show linear run time. In the case of TBD, larger 
images result in less particle coverage of the possible distribution, reducing the quality of the estimate. Since 
the PMV method fully samples each pixel in the image, the estimation quality is expected to remain the same 
regardless of image size. The computational time for each method, on the other hand, is the opposite of the 
expected estimation quality. As the number of pixels increase, PMV’s running time is expected to increase 
linearly while TBD will remain constant for a given particle population size. To verify these expectations, we 
report the running time for each of the synthetic image data sets. Timing results are reported for each of the 


three types of image data set in their respective sections in Chapter V. 
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V. RESULTS AND DISCUSSION 


This chapter provides results and analysis of those results for the experiments described in Chapter 
IV. The performance of the median, approximate median and mixture of Gaussians (MoG) background sub- 
traction methods are discussed in Section A. Section B analyzes the effects of increased pixel discretization 
on the quality of the estimated path and the running time of pixel-matched Viterbi (PMV). In Section C, we 
compare the run-time performance of PMV and track before detect (TBD) for different input sizes. Sections 
D and E present the mean root mean squared (RMS) performance for each filter on the synthetic and quasi 
real-world sets, as well as the detection rate for each method over varying signal-to-noise ratio (SNR) values. 


Finally, Section F compares the paths produced by each method for a real-world target. 


A. BACKGROUND SUBTRACTION 


Figure 7 shows the precision-recall curves for the median, approximate median, and MoG background 
subtraction methods. The logarithmic axes are used to emphasize the relationship of these methods, however, 
the overall precision and recall values for the subpixel target are well below those reported in [53] for larger 
targets using the same background data. The F-score for each of the three methods is shown in Figure 8. The 


score for dynamic and static background is given for each method. 


Each of the three background subtraction techniques show strengths and weaknesses in the test data. 
In precision and recall, the approximate median and median methods both provide better results than the MoG 
method, however, the precision for all three methods fall below 10%, while recall is less than 2%. The poor 
recall values are due to the low SNR ratio of the target, as well as the small target area. In the test data set, 
the number of target pixels ranges from 4 to 9 pixels depending on the location of the target. The total image 
size is 75 by 100. Since background subtraction methods typically segment large foreground objects such as 
pedestrians and cars, even with a low peak SNR value, the large target area results in much more information 
to exploit versus the subpixel problem. While the camouflage data set attempts to measure the performance 
of the background subtraction methods with low SNR targets, the results presented in [53, 52, 54] have a peak 
SNR of 15.13 with an average area of 11,313 pixels, as opposed to a subpixel target with area < 1 pixels. 
The poor performance of MoG in this particular case is due to the similarity between target intensity and 
background intensity. In order to increase recall, a very small threshold value is necessary. This small value 


negatively impacts the precision of the MoG filter. 


Due to dynamic background elements in the data set, a low threshold value results in higher classi- 


fication of the background motion as foreground instead. This results in the low precision values due to a 
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Figure 7: Precision-recall curves for approximate median, median, and mixture of Gaussians background subtrac- 
tion with a subpixel target in the foreground. 


large number of incorrect foreground classifications. The MoG method, however, achieves a higher F-score 
than the two median methods. This is due to the ability to model each pixel with multiple possible intensity 
distributions, allowing MoG to better account for the dynamic, repetitive motion. Unfortunately, while MoG 
provides a better F-score, Figure 7 shows that the improvement is mostly in precision, while recall does not 


improve noticeably from the static background. 


The results indicate that the chosen background subtraction method must take into account the ex- 
pected dynamics of the background scene. For example, if a large At value is used, background can be 
expected to change in illumination and content to some degree over that interval. This requires a method that 
can adapt quickly to global scene changes. Likewise, dynamic background components will have a strong 
effect on the performance of both the background subtraction method and the target estimator. None of the 
methods tested perform strongly in all of these regimes, but each method has strengths that can be tailored to 


known scene conditions. 


The overall poor precision and recall scores for the tested methods given a subpixel target indicate that 
the subpixel problem is novel enough to warrant a background subtraction method that is tailored to suit the 
problem domain. Each of the tested methods were developed for use on multi-pixel targets with reasonable 
SNR values and large target areas. The subpixel target presents challenges that these methods do not attempt 


to address, such as low SNR and aperture diffraction effects. Additionally, each of these methods is tested as 
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Figure 8: Comparison of F-scores for approximated median, median, and mixture of Gaussians background 
subtraction with a subpixel target in the foreground. 


a stand-alone operation. We did not attempt to modify them to use possible target locations as a conditioning 
variable in the threshold decision. While not tested in this research, it is possible that the estimation method 
and background subtraction method can be built to complement each other with feedback, resulting in better 


background removal as well as better estimation. 


Despite the shortcomings of each of the background subtraction methods, results from the quasi real- 
world data set confirm that background subtraction is a critical component of PMV. For example, the RMS 
error for PMV on the plane sequence without background subtraction is 25.04. This same sequence using 
median background subtraction has a RMS error of 0.3808. It is unclear whether the background subtraction 
method achieves a limit to the amount of improvement possible or if further gains in accuracy can be achieved 
with better background subtraction techniques. Further research in this area is necessary to quantify the 
impact of background subtraction for different frame sequences, and to determine whether feedback from the 


estimator can be leveraged to improve the background subtraction accuracy. 


B. PIXEL DISCRETIZATION 


Figure 9 shows the RMS for p values of 2, 6, 10, 14, and 18 at SNR values of 1, 5, 10, 15, and 20 
dB. Figure 10 shows the affect of each p value on the total runtime. Values shown are average run times with 


variance less than 0.025s over 20 tests for each p value. 
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Figure 9: RMS versus SNR for p € {2,6, 10, 14, 18}. 
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Figure 10: Mean runtime over 20 tests for each value of p € {2,6,10, 14,18}. 
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Figure 11: Mean TDR over 20 tests for each value of p = {2,6,10, 14,18}. 


The effects of the number of discrete pixel positions used in the discretization of the pixel space are 
shown in Figures 9 and 10. As p increases, the estimation accuracy increases up to a value of p = 14. At 
p = 18, the estimation accuracy decreases again. The increase in accuracy is expected. For small values of 
p, fewer subpixel positions are sampled, and the RMS can only decrease to the size of the discretization bin. 
As the number of bins increases with higher p values, the ability to provide more accurate target locations 
translates directly to lower RMS. Given this explanation, however, the expected response to increased p is 
for the RMS to asymptotically approach a lower bound imposed by the sensor noise. Instead, when p = 18, 


we see a decrease in RMS relative to p = 10 and p = 14. 


The reason for the discrepancy in RMS for increasing values of p is unknown. Further experiments 
show an additional increase in RMS for p = 22, followed by a decrease for p = 26 and p = 30, although 
the RMS at p = 30 is still higher than p = 10. A possible explanation is that the higher p values cause the 
probability density to be spread across too many points, resulting in Viterbi arbitrarily breaking ties between 
equal probability states, however, this explanation does not fully account for the increase and subsequent 
decrease in RMS values, only the increase. Give the low RMS value achieved for all p > 10, however, this 


discrepancy is not large enough to warrant further investigation with regard to the best value of p to use. 


Since increases in p result in a p* increase in the total number of states estimated with the Viterbi 


algorithm, the computation time is expected to increase exponentially with increasing p. Figure 12 shows the 
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average run time of PMV over 20 tests at each p value. The increase in run time and decrease in RMS suggest 
that a possible improvement to the PMV method would be to run it in two passes, using the p = 1 value to 
determine the pixels most likely included in the target path, then again with p = 10, performing estimation 
only on those pixels and their neighbors selected in the first pass. This iterative improvement method could 
lead to increased speed with equal accuracy relative to running PMV once with a high p value. For very large 


images, the timing improvement may be significant. 


C. COMPUTATION TIME 


Table 2 shows the average run time in seconds for each set of synthetic image sizes, as well as the quasi 
real-world and real-world sequences. Each run was performed on 2.8 GHz quad-core Intel Xeon processor 
with 24GB of 800 MHz DDR2 RAM with nominal operating system processes running in the background. 
The same information is shown in Figure 12 with the x-axis set to number of frames times number of pixels. 


This format helps to emphasize the linear trends of both methods as pixel count increases. 


The run times in Table 2 and Figure 12 show the linear time complexity of PMV with respect to 
number of frames and number of pixels. The time complexity of TBD is also linear with respect to number 
of particles, however, as seen in Figure 12, the running time is affected by other factors as well. The main 
reason for the variability in run times for TBD is in the jump Markov model used by the filter. As the number 
of particles with existence state set to true increases, the number of motion model evaluations increases. 
This results in two different complexity models with different constant values associated with them, however, 
both are linear with respect to the number of particles used. The car data set results in a large increase in 
computation time because TBD is attempting to track a number of targets simultaneously with greater than 
90% of the population in the existence model for the majority of the run. A similar characteristic can be seen 
in the increase in running time for the real-world data set. In both cases, greater than 90% of the particle 
population is in the existence model, while the other data sets typically average at 75% of the population in 
this model. While this means scenes that result in high probability of detection for TBD will require more 
computation time, the overall complexity for TBD is still linear with respect to the number of particles, and 


lower than PMV. 


D. SYNTHETIC IMAGERY 


Figure 13 shows the mean RMS over 20 runs of synthetic data for each image size. The horizontal 
dotted line indicates the RMS at which estimation achieves subpixel accuracy. Figure 14 shows the average 
RMS estimate for only those points that fall within the ground truth bounding box region for each ground 


truth position for both TBD and PMV on the three synthetic data sets. Figure 15 shows the average tracker 
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Image Size | # Frames | Mean Run Time (s) | Std of Run Time (s) 

30x30 35 17.99 2.49 

Synthetic 85x85 55 58.38 1.78 

200x200 55 403.77 8.34 
PMY Plane 80x60 491 422.49 - 
Car 32x51 301 76.92 - 
Real-World 50x41 106 34.67 - 

30x30 55 1.30 0.05 

Synthetic 85x85 55 1.94 0.07 

200x200 ae) 6.39 0.61 
zee Plane 80x60 491 15.42 - 
Car 32x51 301 39.60 - 
Real-World 50x41 106 3.57 - 





























Table 2: Run times for PMV and TBD for synthetic, quasi real-world, and real-world data sets. Mean run times 
are reported for synthetic imagery over 20 separate tests, quasi real-world and real-world report actual 
run time for a single instance. 
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Figure 12: Run times for TBD and PMV for synthetic, quasi real-world, and real-world data sets. Mean run 
times are reported for synthetic imagery over 20 separate tests, quasi real-world and real-world report 
actual run time for a single instance . 
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Figure 13: Mean RMS over 20 test runs from 1dB to 20dB SNR for PMV and TBD. 
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Figure 14: Mean RMS—TP over 20 test runs from 1dB to 20dB SNR. 
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Figure 15: Mean TP over 20 test runs from 1dB to 20dB SNR. 


detection rate for TBD and PMV for each of the three synthetic data sets based on the percentage of estimated 


points that fall within the bounding box of a ground truth point. 


Using the RMS measure of performance, PMV outperforms TBD in every synthetic data set for SNRs 
greater than or equal to 4dB. At SNR values from | to 4dB, the methods are relatively comparable, however, 
both methods fail to achieve subpixel estimation error. The tracker detection rate for PMV is greater than 
80% for SNR of 3dB and greater, with TBD producing greater than 80% true positive rate starting at 7dB for 
30x30 images, and 12dB for 85x85. TBD fails to achieve greater than 80% detection rate for the 200x200 
data set at any tested SNR. Using the RMS—TP measure, PMV outperforms TBD on all synthetic data sets. 


The image size has relatively little impact on the quality of the PMV solution. For all three image 
sizes, PMV achieves better than 80% tracker detection rate at SNR of 3dB or greater. As shown in Figures 
13 and 15, the RMS and detection rate are closely clustered across image sizes. In relation to TBD, which 
is image size dependent, PMV does not appear to reduce quality with image size. However, as expected, the 


linear growth of PMV with respect to the number of pixels results in much longer computation time. 


The computational load of PMV can be partially addressed through parallelization techniques. Pro- 
filing shows that the distance transform (DT) method accounts for a total of 37% of the total run-time. The 
manner in which this method iterates over each possible motion parabola in the set of possible motions makes 
it difficult to parallelize, however, it is possible to perform parallel instances of DT on spatially segmented 


portions of the image using a MapReduce paradigm [86]. The Map function consists of mapping spatially 
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segmented sets of parabolas to processing units in order to solve the DT for those units. The Reduce function 
calculates the minimum of the mapped results. This requires that each segment overlaps it’s neighboring 
segments by the maximum expected target velocity. If target velocity cannot be limited in practice, the paral- 
lelization may not be possible, however, for many targets, the maximum velocity is limited by sensor frame 


rate and target kinematics. 


The RMS—TP shown in Figure 14 shows the best performance of each method when all outliers that 
are not within a | pixel bounding box of the target location are excluded. The TBD methods do not extend 
all the way to 1dB SNR due to poor tracking performance resulting in no true positive values to plot. When 
only using true positive values, PMV outperforms TBD for all synthetic data sets. This shows that, even 
under ideal circumstances, the variance of the TBD method is greater than the PMV method. The most likely 
reason for this is that TBD relies on a population of particle hypothesis that are adjusted at each time step. 
Each adjustment due to the motion model injects variation to the TBD estimate. This limits the lower bound 
performance of TBD. PMV, on the other hand, selects the mode of the motion model at each step. While 
noise can disturb this estimate, the sequence of states serves to damp out noise spikes. Additionally, the entire 
maximum log-likelihood sequence is stored for each time step. This means that poor estimates made early in 
the sequence can be discarded for sequence estimates that result in better log-likelihood in later observations. 
With no variance injection in the estimate and memory of previous estimates that may have resulted in good 


solutions, PMV is able to produce lower variance and more consistent estimates than TBD. 


Appendix B shows the estimated paths for each of the synthetic image sizes and SNR combinations. 
In the TBD examples, the paths tend to start in the middle of the image before converging to the target path. 
The reason for this is that TBD begins to detect the target before it can develop a position estimate. Since 
the position estimate is based on the location of each particle in the existence set, the population is often 
dispersed throughout the image when the detection threshold is first met. This causes large initial errors 
in the path estimation, however, the RMS—TP shows that even when the estimator has converged with the 
target path, the estimation error continues to be larger than PMV. Increasing the detection threshold to force 
convergence on the target before registering target detection caused TBD to fail to detect the target at lower 
SNR values. Figures 21 and 20 show the results of varying the detection threshold for TBD. All results 
shown here use a threshold of 0.7 which achieved the maximum tracker detection rate (TDR) and minimum 


RMS across the most SNR values. 


The results for RMS—LR, discussed in Section A are not presented. In every instance, the RMS— 
LR values were the same as RMS. This indicates that neither estimation method has a consistent path-wise 


estimation bias. 
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Figure 16: Three frames showing the PMV estimated paths for the “plane” and “car” quasi real-world data set. 
A box is drawn over the target in each frame, and the current path estimate is drawn as a white 
line. The top row contains frames 1, 245, and 491 from the plane data set. The second row contains 
frames 1, 150, and 301 from the car data set. All images have been contrast enhanced for clearer 
viewing. 


E. QUASI REAL-WORLD 


The PMV RMS for the car data set is 1.89, and the plane data set is 0.36 with detection rate of 0.66 
and 1.0 respectively. RMS for the car data set and the plane data set cannot be calculated for the TBD filter 
because it did not surpass the detection threshold of 0.7 in either case. Further reductions in the threshold 
down to 0.5 did not improve this result, while thresholds below 0.5 resulted in large clusters of particles 
throughout the image with cluster centers mostly tracking background motion in the scene. Figure 16 shows 
the first, middle and last frames from each data set with a square centered on the actual target position, and 
the PMV estimate at the current frame. Figure 17 shows the same three frames with the TBD estimated 


distribution overlaid as red dots on the image state space. 


The quasi real-world imagery data set highlights two advantages of PMV over TBD: moving back- 
ground suppression, and multi-target suppression. The plane data set exhibits a moving background that 
causes the TBD method to fail, while the car data set contains multiple targets, which violates the single 
target assumption used for both filters. While PMV properly rejects the moving background components 
of the plane scene, TBD is unable to converge on the target due to that movement. In the car scene, PMV 


produces a single track that produces the highest overall likelihood. In this data set, target paths come into 
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Figure 17: Three frames showing the TBD estimated paths for the “plane” and “car” quasi real-world data set. 
A box is drawn over the target in each frame, and the current path estimate is drawn as a white 
line. The top row contains frames 1, 245, and 491 from the plane data set. The second row contains 
frames 1, 150, and 301 from the car data set. The red dots indicate the particle distribution of the 
TBD filter at the current frame. All images have been contrast enhanced for clearer viewing. 
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close proximity, resulting in PMV estimating a path that spans three separate targets. TBD on the other hand 
spreads particles to each of the targets in the scene, with no target producing a high enough likelihood to 
result in convergence of the population on a single mode. This produces no path estimate, and the resulting 


clusters of particles are not clear enough to make any determination of target locations or directions. 


In the plane data set, a layer of clouds covers the bottom half of the scene. These clouds are moving 
slowly, and have a texture that produces high subpixel target likelihoods. Using the median background 
subtraction filter removes much of the cloud bank from the image, but it does not remove the parts of the 
clouds that are changing through the scene. Despite this, PMV tracks the plane starting in the first frame of 
the data set. TBD, on the other hand, converges to the target location within the first five frames, but then 
disperses across the cloud bank. The remainder of the data set, TBD particles are spread throughout the cloud 
bank with no particles tracking the airplane. This behavior is due to the maximum likelihood (ML) nature of 
TBD. When an ML estimate of target location is made in the cloud bank, the majority of particles that were 
tracking the target are lost in the resampling step. After several time steps, any particles that were tracking the 
correct target have been lost. In order to reacquire the target, TBD must randomly place a new particle near 
enough to the target to have a high enough likelihood to be selected more frequently. With high likelihood in 


the clouds, this does not occur. 


Since the PMV method samples every pixel, even if higher transient likelihoods occur over the course 
of multiple frames, unless those transient likelihoods are very close to the actual target, they do not affect 
the maximum path that represents the true target motion. Due to this dense estimate, high likelihoods are not 
“missed” as they can be with TBD. Additionally, since the estimate is based on the maximum prior path for 
each pixel, unless noise or background motion occurs with higher likelihood very near to the target, the target 
path is preserved once the noise or motion has ceased to provide high likelihoods. This means that PMV is 


more easily able to reject temporary motion of background objects. 


FE. REAL-WORLD 


Figure 18 shows the first, middle and last frames from the data set with a square centered on the actual 
target position, and the PMV estimate at the current frame on the top row. The bottom row of Figure 18 


shows the same three frames with the TBD estimated path as a white line. 


The real-world data set provides a comparison between PMV and TBD on a known subpixel target. 
Manual estimation of ground truth data for this set resulted in a highly varying path that is less accurate 
than both PMV and TBD; therefore, we compare the estimated paths to each other to provide insight into 
the relative performance of the two methods. Figure 18 shows the estimated paths for the target for various 


frames. Figure 19 shows a close-up comparison of both methods’ estimates. The path estimated by PMV is 
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Figure 18: Three frames for the real-world data set. The top row shows the PMV estimated path up to the 
current frame for frames 47, 87, and 126 with a rectangle placed at the current estimated location. 
The bottom row shows the TBD estimated path up to the current frame for frames 47, 87, and 126. A 
rectangle is placed over the current estimate if TBD has passed the detection threshold for the frame 


(frames 47 and 126 do not have rectangles because TBD has fewer than 70% of particles tracking the 
target. 
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Figure 19: Close-up of the PMV estimated path and TBD estimated path in the real-world data set. 


smooth, and follows the correct straight line motion of the emitter. The path estimated by TBD on the other 
hand has high variance relative to the straight line path. Both paths closely track the emitter’s correct position 


in the data set, however the high variability of TBD indicates that it does not track the target as accurately. 


The smoothness of the PMV estimate is due to the use of the entire path in estimation. Low probability 
estimates at the beginning of the sequence may still be included in the final solution. For TBD, on the other 
hand, there is no mechanism for keeping track of prior path estimates when the filter has not converged. As 
seen in the path plots in Appendix B, the estimate tends to be very close to the center of the image until 
the filter has converged over the course of several frames. As a consequence of this behavior, any estimates 
made prior to convergence will result in very large error. As discussed in Section D the variability of the 
TBD population after convergence results in a shifting mean. While the mean stays close to the actual target 


location, the fluctuation produces poorer results than the maximum a posteriori probability (MAP) result. 


While the low noise characteristics of the sensor used to the capture the real-world data set preclude 
testing at lower SNR values, the observed results are consistent with the predicted results from the synthetic 
data set. At a peak SNR of 20dB, both methods are expected to produce subpixel position estimates with 
PMV providing a better estimate. As seen in Figure 19, the paths produced by each method are very close to 


each other, with the smoother path of the PMV more closely matching the actual motion of the target. 


G. CONCLUSION 


This chapter presented the results of for the experiments outlined in Chapter TV. Background subtrac- 
tion precision and recall curves were presented and the median method is shown to provide the best overall 
performance in the subpixel domain. The RMS error of PMV for varying values of p was shown, with p = 10 
chosen as the best compromise between computation time and estimation quality. The computation time of 
PMV and TBD were compared to show that expected linear growth rate of PMV with respect to the image 
pixel count and number of frames, while TBD showed an increase in computation time for increasing number 
of frames. Estimation results for synthetic, quasi real-world, and real-word data sets were shown, with PMV 


producing lower RMS error and higher TDR than TBD on each of the data sets. 
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VI. CONCLUSION 


This dissertation presents a method for tracking subpixel targets, called pixel-matched Viterbi (PMV). 
A likelihood function is specified using a template matched filtering and templates generated using the sen- 
sor’s point spread function (PSF) to determine the likelihood of subpixel targets in a single image. For any 
motion model featuring additive Gaussian noise, the likelihood and motion model are combined to calculate 
the maximum a posteriori probability (MAP) of target positions over a given sequence of images using the 
Viterbi algorithm and distance transform function. The final result consists of the most likely path of the 


target through the sequence of images. 


We compared PMV to a current state of the art tracking method, the track before detect (TBD) 
particle filter. With respect to mean root mean squared (RMS) error over a series of tests, the estimated 
path is superior to TBD for images down to 4dB signal-to-noise ratio (SNR) and equivalent to TBD below 
this level. With regards to number of estimated path points within | pixel of the actual target position, the 
proposed method achieves greater than 80% detection rate with SNR as low as 3dB versus 7dB for TBD. 


Unlike TBD, the mean RMS is shown to remain consistently low regardless of image size. 


Finally, we discussed shortcomings of the presented method such as the need for better background 
subtraction in the presence of dynamic background motion, and to assume target presence in the image se- 
quence. For backgrounds with many dynamic elements, and image sequences with larger than subpixel 
objects, the background motion or texture on the larger object causes the tracker to follow the wrong tar- 
get. While background subtraction partially addresses the dynamic background problem, better methods are 
needed to further improve the solution. For moving foreground objects that are larger than a pixel, the missed 
estimation typically occurs due to subpixel texture on the target. Methods to reject or discount larger targets, 


or to track multiple subpixel targets simultaneously may help alleviate this problem. 


The implications of this work can be applied in a number of problem areas such as maritime moni- 
toring and tracking from space, unmanned aerial vehicle (UAV) sense and avoid, stealth aircraft tracking, and 
many other applications where small targets cannot be easily resolved by an optical sensor. This method al- 
lows planners to reduce sensor pixel count without compromising the ability to track certain classes of targets 
due to size. Additionally, it allows for analysis of low resolution, coarse imagery rather than requiring higher 
resolution, allowing for bandwidth savings. For applications such as sense and avoid and infrared search and 
track (IRST) this method has the potential to provide earlier warning for incoming targets than is currently 


available. This additional warning window allows for more time to take appropriate countermeasures. 
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A. FUTURE WORK 


While the results show that PMV performs better than the TBD method, they also indicate areas 
for further research. As discussed in Chapter II, Section D the type of scene dictates the best background 
subtraction method to use. Additionally, the background subtraction methods tested were not developed to 
use any information from the subpixel domain. An improved method that is tightly integrated with the MAP 
calculations has the potential to increase the performance of the tracker by eliminating all background-related 


subpixel textures. 


Another area of research that will benefit the PMV method is the ability to perform a detection deci- 
sion. Currently, PMV assumes that a target is always present in the input data. If no target exists, PMV will 
return the sequence of states that best describes the noise. While this path is assigned a probability value by 
the MAP estimator, a better method would determine when a target appears in the data, and when it disap- 
pears from the data. Using a detection decision would allow PMV to run on large video data sets with no 


need to determine whether or not a subpixel target exists in any of the frames. 


The performance of PMV precludes its use in real-time imagery. As suggested in Chapter IV, Sec- 
tion C and Chapter V, Section C, a number of techniques may reduce overall processing without reducing 
estimation quality. For example, adaptive discretization has the potential to reduce the total number of states 
considered by the Viterbi algorithm. Additionally, parallelization may be employed to increase the frame 


rate. Through use of parallelization, we believe that PMV can be made to run on live video. 


In addition to making improvements on PMV, a number of application areas exist that have the poten- 
tial to provide useful testing corpora. These will likely introduce scenes with features that were not considered 
or tested in this work, possibly highlighting strengths and weaknesses not yet identified. One such corpus is 
the asteroid tracking database collected by the Lowell Observatory as part of their Near-Earth-Object Search 
(LONEOS). This data set contains thousands of image sequences taken by earth-based telescopes over fixed 
time intervals. The goal of the LONEOS project is to detect small asteroids in these sequences. Currently 
this is accomplished through crowd-sourcing methods with individuals looking for minute changes between 


images. Using PMV, we believe this search can be automated over the LONEOS data set. 


Other possible improvements to PMV are to extend it for multiple target detection, and include camera 
ego-motion in the estimation. These improvements would allow for an expanded set of applications to include 
aerial target tracking and acquisition. While a number of multi-target methods exist in the literature, none of 
these methods have been attempted on subpixel targets. These methods are tightly coupled with the detection 
and estimation steps of the tracking method used. While not trivial, we believe the same techniques can be 


applied to PMV. 
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B. CONCLUSION 


The primary thesis of this dissertation is that subpixel tracking for optical sensors is possible, and can 
be accomplished with a similar level of accuracy as current state of the art methods for multipixel targets. 
Experimental results support this thesis and indicate that even greater accuracy than current state of the art 
methods is possible. While the proposed method is more computationally intensive that the current state of 
the art, advances in multicore architectures, vector processing, and embedded processing offer the necessary 
level of performance to use this method on real-world data. The subpixel tracking method is the first such 
method the author is aware of in the literature and uniquely fills a gap in capability for the tracking and 


computer vision communities. 
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APPENDIX 


Figures 20 and 21 show the estimation root mean squared (RMS) error and tracker detection rate 
(TDR) for the TBD method for varying detection thresholds. The authors of TBD recommend a threshold of 
0.6 as the detection decision, however the figures show that, for the subpixel problem set, better results are 


achieved at the 0.7 thresholds. The value of 0.7 was used for all comparison data runs in Chapter V. 


Figures 22, 23, and 24 show the TBD estimated paths for 20 separate observation sequences using 
the same ground truth motion. The ground truth is overlaid on these in red. These plots show the type of 
estimation error that can be expected as signal-to-noise ratio (SNR) decreases. Many of the paths begin in 
the center of the image, then converge to the ground truth path. This is a consequence of the maximum 
likelihood (ML) estimation technique which produces higher variance estimates in the beginning, with lower 
variance as more observations are made. The high variance estimates tend towards the center of the image 
because the filter has exceeded the threshold limit for detection before the variance has reduced enough to 


produce accurate estimates. 


Figures 25, 26, and 27 show the PMV estimated paths for 20 separate observation sequences using 
the same ground truth motion. The ground truth is overlaid on these in red. Many of the estimated paths down 
to SNR of 1dB track the ground truth path accurately, however, when PMV fails to track the path, the result 
is a path that follows a weighted random walk where the motion model weights the walk towards a path that 


more closely resembles the model’s specification. 
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Figure 20: TBD RMS for threshold values from 0 to 1 at 0.1 intervals. 
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Figure 21: TBD TDR for threshold values from 0 to 1 at 0.1 intervals. 
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Figure 23: TBD estimated paths for 85 x 85 images from 1dB to 20dB. 
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Figure 24: TBD estimated paths for 200 x 200 images from 1dB to 20aB. 
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Figure 25: PMV estimated paths for 30 x 30 images from 1dB to 20dB. 
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Figure 27: PMV estimated paths for 200 x 200 images from 1dB to 20dB. 
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